Goto

Collaborating Authors

 pre-training bert


FinBERT-QA: Financial Question Answering with pre-trained BERT Language Models

arXiv.org Artificial Intelligence

Motivated by the emerging demand in the financial industry for the automatic analysis of unstructured and structured data at scale, Question Answering (QA) systems can provide lucrative and competitive advantages to companies by facilitating the decision making of financial advisers. Consequently, we propose a novel financial QA system using the transformer-based pre-trained BERT language model to address the limitations of data scarcity and language specificity in the financial domain. Our system focuses on financial non-factoid answer selection, which retrieves a set of passage-level texts and selects the most relevant as the answer. To increase efficiency, we formulate the answer selection task as a re-ranking problem, in which our system consists of an Answer Retriever using BM25, a simple information retrieval approach, to first return a list of candidate answers, and an Answer Re-ranker built with variants of pre-trained BERT language models to re-rank and select the most relevant answers. We investigate various learning, further pre-training, and fine-tuning approaches for BERT. Our experiments suggest that FinBERT-QA, a model built from applying the Transfer and Adapt further fine-tuning and pointwise learning approach, is the most effective, improving the state-of-the-art results of task 2 of the FiQA dataset by 16% on MRR, 17% on NDCG, and 21% on Precision@1.


FaBERT: Pre-training BERT on Persian Blogs

arXiv.org Artificial Intelligence

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse and cleaned corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications. FaBERT is openly accessible at https://huggingface.co/sbunlp/fabert


The Effects of In-domain Corpus Size on pre-training BERT

arXiv.org Artificial Intelligence

Web scraping Encoder Representations from Transformers is one oft-cited method used to gather publicly (BERT) (Devlin et al., 2018) and its variants available documents to increase one's in-domain (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019) training corpora. For example, LEGAL-BERT has proven to be an excellent strategy and achieved (Chalkidis et al., 2020) authors scraped publicly state-of-the-art results on many downstream natural available legal text from six different sources, to language processing (NLP) tasks. Most models achieve a total corpus size of 12 GB. Nevertheless, focused their pre-training efforts on general domain this data collection process is laborious and text. For example, the original BERT model was time-consuming and could discourage researchers trained on Wikipedia and the BookCorpus (Zhu from conducting such experiments for fear of being et al., 2015). Many other following efforts focused unable to collect enough data. On the other hand, on adding additional texts to the pre-training process it would also be a waste of resources if, after all to create even larger models with the intent the data is collected, it turns out the data is still of improving model performance (Liu et al., 2019; not enough for pre-training and the model ends up Raffel et al., 2019). However, recent works have having poor performance.


Pre-Training BERT on Arabic Tweets: Practical Considerations

arXiv.org Artificial Intelligence

Pretraining Bidirectional Encoder Representations from Transformers (BERT) for downstream NLP tasks is a non-trival task. We pretrained 5 BERT models that differ in the size of their training sets, mixture of formal and informal Arabic, and linguistic preprocessing. All are intended to support Arabic dialects and social media. The experiments highlight the centrality of data diversity and the efficacy of linguistically aware segmentation. They also highlight that more data or more training step do not necessitate better models. Our new models achieve new state-of-the-art results on several downstream tasks. The resulting models are released to the community under the name QARiB.


Pre-training BERT from scratch with cloud TPU

#artificialintelligence

In this experiment, we will be pre-training a state-of-the-art Natural Language Understanding model BERT on arbitrary text data using Google Cloud infrastructure. With this guide, you will be able to train a BERT model on arbitrary text data. This is useful if a pre-trained model for your language or use case is not available in open source. This guide is intended for NLP researchers who are excited with the BERT technology but are not satisfied with the performance of the available open-sourced models. For persistent storage of training data and model, you will require a Google Cloud Storage bucket.